Scale-free topology of e-mail networks 
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We study the topology of e-mail networks with e-mail addresses as nodes and e-mails as links 
using data from server log files. The resulting network exhibits a scale-free link distribution and 
pronounced small-world behavior, as observed in other social networks. These observations imply 
that the spreading of e-mail viruses is greatly facilitated in real e-mail networks compared to random 
architectures. 

PACS numbers: 89.20.Hh, 89.75.Hc, 05.65.+b 



Complex networks as the World Wide Web or social 
networks often do not have an engineered architecture 
but instead are self-organized by the actions of a large 
number of individuals. From these local interactions non- 
trivial global phenomena can emerge as, for example, 
small-world properties j| or a scale-free distribution of 
the degree g] . These global properties have considerable 
implications on the behavior of the network under error 
or attack ||, as well as on the spreading of information 
or epidemics Q . Here we report that networks composed 
of persons connected by exchanged e-mails show both 
the characteristics of small- world networks and scale- free 
networks. 

The nodes of an e-mail network correspond to e-mail 
addresses which are connected by a link if an e-mail has 
been exchanged between them. The network studied here 
is constructed from log files of the e-mail server at Kiel 
University, recording the source and destination of every 
e-mail from or to a student account over a period of 112 
days p2"[| . The resulting network consists of N = 59,912 
nodes (including 5,165 student accounts) with a mean 
degree of < k > = 2.88 and contains several separated 
clusters with less than 150 nodes and one giant compo- 
nent of 56,969 nodes (mean degree < fci arge > = 2.96). 
The degree distribution n(k), i.e. the distribution of the 
number k of a node's next neighbors, obeys a power law 
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with exponential cut-off (Fig. |j). 

Most of the scaling exponents reported so far for the 
degree distributions of computer and social networks lie 
in the range of -2.0 to -3.4 M. One exception is the 
social network of co-authorships in high energy physics, 
for which Newman found an exceptionally small scaling 
exponent of -1.2 Q. Similar to our work are studies of 
networks of phone calls made during one day. These 
phone-call networks show scale-free behavior of the de- 
gree distribution as well, with an exponent of -2.1 0, |). 

Let us briefly discuss how our result on e-mail net- 
works may be influenced by the measurement process. 
The sampling of the network has been restricted to one 
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FIG. 1: Degree distribution of the e-mail network. The 
double-logarithmic plot of the number of e-mail addresses 
with which a node exchanged e-mails exhibits a power-law 
with exponent -1.81 ± 0.10 over two decades. This distribu- 
tion is used to calculate estimates for the clustering coefficient 
and the average shortest path length for the entire network 
(see text). 



distinct e-mail server. Therefore, only the degrees of 
accounts at this server are known exactly. Here, these 
internal accounts correspond to e-mail addresses of lo- 
cal students, whereas the external nodes are given by all 
other e-mail addresses. We resolve the degree distribu- 
tion of internal accounts only (Fig. ||), and find that it 
can be approximated by a power-law ni nt (k) oc k~ 132 as 
well (mean degree < fc int >= 14.86). Since the degrees of 
external nodes typically are underestimated, this expo- 
nent is smaller than for the whole network. For the same 
reason, there are fewer nodes with small degree in the 
distribution of students' degrees. Note that the cut-off 
of both distributions is about the same. Therefore, ex- 
ternal sources addressing almost all internal nodes (e.g. 
advertisement or spam) do not bias the degree statistics. 
Thus, it can be concluded that the e-mail network ex- 
hibits scale-free behavior. 

Furthermore, the e-mail network shows the properties 
of a "small world" Q, i.e. a high probability that two 
neighbors of one node are connected themselves (cluster- 
ing) and a small average length £ of the shortest path 
between two nodes. The clustering is measured by the 
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FIG. 2: Degree distribution of the student accounts in the 
e-mail network. The degree distribution of the subset of stu- 
dent e-mail addresses with completely known degree can be 
approximated by a power law as well with exponent -1.32 ± 
0.18. This exponent is smaller than for the whole network 
since the degree of external nodes is underestimated by the 
measurement. 



clustering coefficient C of a network which is defined in 
the following way: The clustering coefficient C v of a node 
v is given by the ratio of existing links E v between its 
k v first neighbors to the potential number of such ties 
\k v (k v — 1). By averaging C v over all nodes one arrives 
at the clustering coefficient C of the network 
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To verify the small- world properties, experimental re- 
sults are compared to the respective values derived for 
random networks, as well as for networks with links as- 
signed randomly but obeying the same degree distribu- 
tion. We will call a network a "random network" if the 
probability p that an arbitrary pair of nodes is connected 
is constant p = (k)/(N — 1), leading to a Poissonian de- 
gree distribution. In this case, the clustering coefficient is 
just this probability C ran d = P- Additionally, we deduced 
an estimate C for an upper bound of the clustering co- 
efficient of a network with identical degree distribution, 
but randomly assigned links. Hence, C gives an upper 
bound of the clustering that is expected from the degree 
distribution alone. Employing the generating function 
approach for networks with arbitrary degree distributions 
Pj and assuming that fluctuations of the mean degree of 
a node's neighborhood can be ignored, results in: 
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This equation exactly applies in the case of the Poisso- 
nian degree distribution of a random network (C ran d = 
C). Next we compare the experimental average shortest 
path length £ with the respective value t' of a network 
with the same degree distribution and randomly assigned 



links. With the generating function approach we obtain 



(4) 



For the Poissonian distribution of a random network, 
Eqn. (Eh simplifies to: 
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Note that only N, (k) and (k 2 ) are used for the estimates 
£ and ^ ran d as in the derivation of Eqn. (^). 

In the following, the experimental values are compared 
to the above estimates. The experimental clustering coef- 
ficient C — 0.156 is about one magnitude larger than one 
would expect solely from the degree distribution (C = 
0.0187). For a random network, the respective clustering 
coefficient is C ran d = 4.82 • 10~ 5 . 

The mean shortest path length in the giant component 
was determined to I = 4.95 ± 0.03 with the Dijkstra 
algorithm [[IT]. It is larger than the mean shortest path 
length in a network with the same degree distribution 0] 
(£' = 3.43) since more links are consumed for forming 
local clusters. It is still smaller than the path length 
of a random network ^ ran d = 10.10 (where each pair of 
nodes is connected with a constant probability leading to 
the same mean degree) because of the highly connected 
"hubs" present in a scale-free network. 

To further investigate the emergence of the scale-free 
degree distribution, we study the e-mail network as a di- 
rected graph, where an e-mail corresponds to a directed 
link pointing from the sender to the receiver. Although 
the e-mail network has to be treated as an undirected 
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FIG. 3: In-degree distributions for the e-mail network. The 
double-logarithmic plot of the in-degree distributions for all 
nodes (filled diamonds, < ki n > = 1.62) and for student nodes 
only (open diamonds, < k[ n >i nt =13.06) shows a power-law 
distribution with an exponent of -1.49 ± 0.18. Note that 
again the in-degree of external nodes is underestimated by 
the measurement process. 
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graph in the context of virus spreading (see below), it 
seems reasonable that the sending and receiving of e- 
mails are governed by different processes. Again, the 
analysis is done for the distributions of all nodes and of 
internal nodes only, where for the latter, the in- and out- 
degree can be determined exactly. The distribution of the 
in-degree i, i.e. a node's number of different nodes it has 
received e-mails from, are very similar for all nodes and 
for internal nodes, respectively (Fig. ||). They can both 
be approximated by a power law n(i) oc i~ x over about 
two orders of magnitude. Deviations of the two distribu- 
tions for in-degrees i < 6 are due to the underestimation 
of the degree of external nodes. One explanation for an 
in-degree exponent of about -1.5 is the assumption of 
stochastic multiplicative growth as in the model of Hu- 
berman and Adamic jl^, [Hj). They proposed that the 
number of links a node received at a time step is a random 
fraction of the number of links it already has received. 
The treatment of the out-degree is more difficult. For 
the whole network, the distribution of out-degree j, i.e. a 
node's number of links to other nodes, shows pronounced 
scale-free behavior n(j) cx j~ 2 03 (Fig. However, the 
corresponding distribution for internal nodes is broad but 
does not show scale-free behavior over a sufficient range. 
This may be caused by the limited size of the sample 
but may also point to the systematic error caused by 
students possibly using different (external) accounts for 
sending e-mails. The out-degree scaling exponent of the 
whole network lies in a quite common range for commu- 
nication and social networks, as, e.g., the movie actors' 
network or the phone call network |5j , where the princi- 
ple of preferential attachment can be used for modeling 
|| . It applies to the assumption that the probability pj 
that a link originates in the set of nodes with out-degree 
j is proportional to the number of links already starting 
in this set f\j]: 



(6) 



This corresponds to Simon's general model for such copy 
and growth processes |lj, [lf|. Let us briefly apply this 
model to the e-mail network. From our data we estimated 
the ratio of the growth rate of nodes to the growth rate 
of links to a =0.597 nodes per links which is sufficient to 
calculate the scaling exponent 7 pljfl: 
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Thus, the preferential linking model leads to a steep ex- 
ponent of -3.48 not in accordance with observation. On 
the other hand, a model based only on transitive link- 
ing JIo[| , i.e. on the assumption that two nodes are more 
likely to be linked if they have a common neighbor, can 
in principle reproduce the small-world properties and a 
broad degree distribution but leads to a too high cluster- 
ing and does not yield a power-law degree distribution for 
this particular network. From this we conjecture, that in- 
cluding both preferential and transitive linking may con- 
sistently model the e-mail network. 
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FIG. 4: Out-degree distributions for the e-mail network. For 
the out-degree, only the distribution of internal and external 
nodes (filled diamonds, < fc out >=1.62) exhibits a pronounced 
power-law over two decades with exponent -2.03 ± 0.12. The 
distribution of the out-degree of internal nodes (open dia- 
monds, < fc ut >i n t = 12.39) is broad as well but cannot be 
identified with a scale-free regime which may be due to the 
limited size of the sample. 



What are the implications of the above results for the 
spreading of e-mail viruses? The occurrence of e-mail 
viruses has become a well-known phenomenon in today's 
communication experience. An e-mail virus or e-mail 
worm is a program attached to an e-mail which, when 
opened by the recipient, causes the recipient's e-mail pro- 
gram to remail numerous infected e-mails to e-mail ad- 
dresses found in the address book or in stored e-mails. 
Hence, for the propagation of e-mail viruses the network 
is undirected. This is different for chain e-mails, where 
each recipient is asked to forward the chain e-mail to 
other addresses. E-mail viruses can cause serious dam- 
age to computer networks by destroying data at infected 
computers or by overloading e-mail servers and other in- 
frastructure. In May 2000, for instance, the e-mail worm 
"I love you" infected more than 500,000 individual sys- 
tems worldwide |l(| and obstructed 21 % of the computer 
workplaces in Germany Jr^ |. 

In scale-free networks, the threshold for the propaga- 
tion rate above which an infection of the network spreads 
and becomes persistent is very much lower than in other 
disordered networks, or even vanishes [ fDj| . This means 
that the self-organized structure of the e-mail network fa- 
cilitates the spreading of computer viruses, as well as of 
any other information. In addition, the e-mail network is 
quite robust in case of "failures" of random nodes if, for 
instance, some participant does not answer e-mails for a 
while or uses anti-virus software. However, it is sensi- 
tive to the loss of highly connected participants because 
of the power-law degree statistics ]3J]. Hence uniformly 
applied immunization of nodes is less likely to eradicate 
infections until almost all participants are immunized, 
whereas targeting prevention efforts at the highly con- 
nected sites significantly suppresses epidemic outbreaks 
and prevalence [FLol p0|. 
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These observations suggest helpful and advantageous 
applications, but also point to the inherent dangers of 
e-mail networks. The security of e-mail communication 
can be improved by identifying highly connected hub ad- 
dresses and monitoring them for viruses more strictly, 
e.g. in corporate e-mail networks to prevent the damag- 
ing and costly spreading of e-mail viruses. In a different 
application, making use of the high clustering, commer- 
cial e-mail providers can identify communities of users 
more easily |2l| and focus marketing more efficiently. In 
general, communication by e-mail can be interfered with 
as well as utilized more extensively due to the non-trivial 
topological features of the e-mail network that we found 
here. Exploring the web of e-mails does not only extend 
our knowledge of social and communication networks but 
it also shows how vulnerable and exploitable these sys- 
tems can be. 

In conclusion, we have shown that an e-mail network, 
where nodes are given by e-mail addresses and links by 
exchanged messages, exhibits both small- world proper- 
ties and scale-free behavior. The e-mail network is stud- 



ied in terms of undirected, as well as directed networks. 
Spreading of e-mail viruses is considered, based on the 
appropriate viewpoint of an undirected graph. The scale- 
free nature of the e-mail network strongly eases persis- 
tence and propagation of e-mail viruses but also points 
to effective countcrmeasures. 
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